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Statistical Mechanical Development of a Sparse Bayesian Classifier 
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The demand for extracting rules from high dimensional real world data is increasing in 
various fields. However, the possible redundancy of such data sometimes makes it difficult to 
obtain a good generalization ability for novel samples. To resolve this problem, we provide 
a scheme that reduces the effective dimensions of data by pruning redundant components 
for bicategorical classification based on the Bayesian framework. First, the potential of the 
proposed method is confirmed in ideal situations using the replica method. Unfortunately, 
performing the scheme exactly is computationally difficult. So, we next develop a tractable 
approximation algorithm, which turns out to offer nearly optimal performance in ideal cases 
when the system size is large. Finally, the efficacy of the developed classifier is experimen- 
tally examined for a real world problem of colon cancer classification, which shows that the 
developed method can be practically useful. 

KEYWORDS: Bayes prediction, Belief propagation, Classification, Disordered system, Replica 
analysis 



1. Introduction 

In recent years, the demand for methods to extract rules from high dimensional data is 
increasing in the research fields of machine learning and artificial intelligence, in particular, 
those concerning bioinformatics. One of the most elemental and important problems of rule 
extraction is bicategorical classification based on a given data set. 1 In a general scenario, the 
purpose of this task is to extract a certain relation between the input x E TZ N , which is a 
high dimensional vector, and the binary output y £ {+1, —1}, which represents a categorical 
label, from a training data set D M = {(x 1 ^ 1 ), . . . , (x^,y^), . . . (x M ,y M )} of M(= 1,2,...) 
examples. 

The Bayesian framework offers a useful guideline for this task. Let us assume that the 
relation can be represented by a probabilistic model defined by a conditional probability 
P(y\x,w), where w stands for a set of adjustable parameters of the model. Under this as- 
sumption, it can be shown that for an input xm+i, the Bayesian classification 

y M+1 = aigmaxP(y M+1 |a; M+1 , D M ) (1) 

„M+1 
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minimizes the probability of misclassification after the training set D M is observed. 2 Here 
P(y M+1 \x M+1 ,D M ) = J dwP(y M+1 \x M+1 ,w)P(w\D M ) is termed the predictive probability, 
and the posterior distribution P(w\D M ) is represented by the Bayes formula 



using the traning data D and a certain prior distribution P(w). 

However, even this approach sometimes does not provide a satisfactory result for real world 
problems. A major cause of difficulty is the redundancy that exists in real world data. For 
example, let us consider a classification problem of DNA microarray data, which is a standard 
problem of bioinformatics. In such problems, while the size of available data sets is less than 
one hundred, each piece of data is typically composed of several thousand components, the 
causality or relation amongst which is not known in advance . 3 Simple methods that handle 
all the components usually overfit the training data, which results in quite a low classifica- 
tion performance for novel samples even when the Bayesian scheme of eq. (1) is performed. 
Therefore, when dealing with real world data, not only is the classification scheme itself very 
important but it is also important to reduce the effective dimensions, assessing the relevance 
of each component. 

The purpose of this paper is to develop a scheme to improve the performance of the 
Bayesian classifier, introducing a mechanism for eliminating irrelevant components. The idea of 
the method is simple: in order to assess the relevance of each component of data, we introduce 
a discrete pruning parameter q 6 {0, 1} for each component xi, and classify (cixi) instead of 
x = (xi) itself. Components for which q = are assigned are ignored in the classification. 
Assuming an appropriate prior for c = (q) that controls the number of ignored components, 
we can introduce a mechanism to reduce the effective dimensions in the Bayesian classification 
(eq. (1)), which is expected to lead to an improvement of the classification performance. 

In the literature, such pruning parameters have already been proposed in the research of 
perceptron learning 4 and linear regression problems. 5 However, as far as the authors know, the 
potential of this method has not been fully examined, nor have practically tractable algorithms 
been proposed for evaluating the Bayesian classification of eq. (1), which is computationally 
difficult in general. We will show that our scheme offers optimal performance in ideal situations 
and we provide a tractable algorithm that achieves a nearly optimal performance in such 
cases. For simplicity, we will here focus on a classifier of linear separation type, however, the 
developed method can be extended to non-linear classifiers, such as those based on the kernel 
method. 6 

This paper is organized as follows. In the next section, section 2, we present details of the 
classifier we are focusing on. In section 3, the performance of the classifier is evaluated by the 
replica method to clarify the potential of the proposed strategy. We show that the scheme 
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minimizes the probability of misclassification for novel data when the data includes redundant 
components in a certain manner. However, performing the scheme exactly is computationally 
difficult. In order to resolve this difficulty, we develop a tractable algorithm in section 4. We 
show analytically that a nearly optimal performance, predicted by replica analysis, is obtained 
by the algorithm in ideal cases. In section 5, the efficacy for a real world problem of colon 
cancer classification is examined, demonstrating that the developed scheme is competitive 
practically. The final section, section 6, is devoted to a summary. 

2. Sparse Bayesian Classifier 

The classifier that we will focus on is provided by a conditional probability of perceptron 
type 7 

P(y\x,w,c) = f ^=Y^ciWiX^j , (3) 

where q G {0, 1} is the pruning parameter and the activation function f(u) satisfies f(u) > 
and f(u) + f{—u) = 1 for Vu G TZ. To introduce the pruning effect, we also use 

P(w,c) = V^e-Zli^-cOl-S ^ Q - Ncj S (j2 c i w i ~ NC ^ ( 4 ) 

as the prior probability of the parameters w and c, where < C < 1 is a hyper parameter that 

— s r N ci — 'i w f 

controls the ratio of the effective dimensions. The factor e ° l > i is included to make the 

2 

normalization constant V = f dw ^e-^^-^^S (j2? =1 cj - iV(7) S (j2iLi c i w f ~ NC 
finite. As this microcanonical prior enforces the probability to vanish unless the pruning pa- 
rameters Ci G {0, 1} satisfy the constraint ^2iL\ Q = NC, each specific parameter choice 
ignores certain N(l — C) components of x for classification. These yield the posterior distri- 
bution P(w, c\D M ) via the Bayes formula 



P(w,c)ut l iiP(y tl \^,w,c) 
J dw £ C P(™, c) nJLi P(y»\x», w, c) ' 



P(w, c\D M ) = _ - - : , (5) 

which defines the Bayesian classifier as 



y M+1 



[■ I V M+1 N \ 

argmax dwTf ^rE^^^ 1 P(w,c\D m ). (6) 

The pruning vector c eliminates irrelevant dimensions of data, which makes x sparse. There- 
fore, we term the classification scheme represented in eq. (6) the sparse Bayesian classifier 
(SBC). 

3. Replica analysis 

To evaluate the ability of the SBC, let us assume the following teacher-student scenario. 
In this scenario, a "teacher" classifier is selected from a certain distribution P t (w Q ). For each 
of M inputs which are independently generated from an identical distribution 
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Pin (a;), the teacher provides a classification label y = ±1 following the conditional probability 

P(y\ 

x,w ) — f Yli=i w olx^j , which constitutes the training data set D M . Then, the 
performance of the SBC, which plays the role of "student" in this scenario, can be measured 
by the generalization error, which is defined as the probability of misclassification for a test 
input generated from P- m (x). 

To represent a situation where certain dimensions are not relevant for the classification 
label, let us assume that the teacher distribution is provided as 

N 



Pt(w )=n 



(l-C t )S(w ol ) + C t e 



(7) 

where C t is the ratio of the relevant dimensions that the student does not know. For 
simplicity, we further assume that the inputs are generated from a spherical distribution 
Pin 0*0 = P S ph{ x ) oc <5 (\x\ 2 — iV), which guarantees that the correlation of qu?/ between dif- 
ferent components is sufficiently weak when the parameters w and c are generated from the 
posterior P(w,c\D M ). As a hard constraint, J2iLi c i w f = J2b=i( c i w i) 2 = NC is introduced 
by the microcanonical prior of eq. (4). This implies that one can approximate the stability 



A~y/C=Qu+{A), (8) 

using a Gaussian random variable u ~ M(0, 1), where (•••) = / dw £ c P(w, c\D M ){- ■ ■ ) and 
Q = jq Ya=i ( c i w i) 2 - This makes it possible to evaluate eq. (6) as 

V M+1 = sign (| Duf (y/C—Qu + (A)) - , (9) 

u 2 

where sign(x) = ^ for x / 0, and Du = -^=e~~ . Further, one can also deal with the average 
stability (A) and the teacher's stability A° = -j/= J2iLi w oixi as Gaussian random variables, 
the variances and covariance of which are given as 

W? = '^jp L ^c u a^Ta) = p, W = Q, (io) 

using XiXj = 5ij where 7TT = J da;P sp h(£c)(- • • ) and R = YliLi w oi ( c i w i)- This, in conjunc- 
tion with the symmetry of y = ±1 in the current system, indicates that the generalization 
error of the SBC can be evaluated as 

SBC 

t g 



J DzU- J Dvf(Jc t -^v + -j=z])e(^J Duf(yc=Qu+,/Qz)-±y 



(11) 

where @(x) = 1 for x > and otherwise, using the macroscopic variables R and Q, which 
can be assessed by the replica method. 8 ' 9 
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Fig. 1. Generalization error versus C for (a) the SBC and (b) Gibbs learning (a — 1, n — 0.05). 

To assess R and Q, we evaluate the average of the n{= 1,2,.. .)-th power of the partition 
function 

. M 

Z{D M )= dw^2P{w,c)Y[P{y»\w,c,xL l ) 

J c n=l 

r M / a \ ( N 



CWj (5 ^ q - CN 
(12) 



with respect to the training data set D M . The analytical continuation from n = 1, 2, ... to 
n <ElZ under the replica symmetric (RS) ansatz provides the expression for the RS free energy: 



N 



i i ain[z n m M )l nM 

0- ^-^(f^fiVc^Qu + VQ*) 



= Ext 

R,Q,R,Q,F,\ 



2a 



JdzJ DvfUCt-^v + ^Lz 



- RR + ^QQ + ^FC - AC 



+ ( / Dzln 



1 + 



exp 



A + 



Qz + Rw 



2(F + Q) 



Wo 



— InV, 

N 



(13) 



where [• • - ] d m Wq denotes the average over the training set D M and the teacher distribution 
of eq. (7), Ext x (- • • ) denotes extremization of • • • with respect to x, which determines R and 

Q, and a = M/N and <■■■>«,„=/ dw ( (1 - C t )S(w ) + cA ] (•••)• 



The generalization error £g BC can be evaluated from eq. (11) using R and Q obtained via 
the extremization problem of eq. (13), which is plotted in fig. 1 (a) as a function of the hyper 
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parameter C in the case of 

f(u) =/(«;«) =«+(l-2K)e(«), (14) 

where k(< 1/2) is a non-negative constant. This figure shows that the developed scheme im- 
proves the classification ability for the assumed teacher model of eq. (7) when the hyper pa- 
rameter C is appropriately adjusted. Actually, e^ BC is minimized when the hyper parameter is 
set to the teacher's value, C = C t , independent of the specific choice of the activation function 
f(u). This is because the microcanonical prior of eq. (4) practically coincides with the teacher 
model of eq. (7) when C = C t and, therefore, the predictive probability P(y M+1 \x M+1 , D M ) 
can be correctly evaluated in such cases. That the classification based on the correct predictive 
probability provides the best performance among all the possible strategies, 2 implies that the 
proposed scheme is optimal for the assumed teacher model if the hyper parameter is correctly 
tuned. 

Two things are worth discussing further. Firstly, the simplest replica symmetry was as- 
sumed in the above analysis, the validity of which should be examined. In fact, the RS ansatz 
can be broken for sufficiently small C, for which certain replica symmetry breaking (RSB) 
analysis 9 is required. However, at the optimal choice of hyper parameter C = C t , the analysis 
can be considered to be correct because this choice of prior corresponds to the Nishimori 
condition known in spin-glass research at which no RSB is expected to occur. 10 Therefore, 
we do not perform RSB analysis in this paper. Secondly, the above analysis implies that 
minimization of the generalization error can be used to estimate hyper parameters for the 
SBC, which will be employed for a real world problem in a later section. However, this is 
not necessarily the case for other learning strategies. For example, Gibbs learning, which was 
extensively examined in the last few decades 7 ' 11 may be used for the classification. In this, 
the classification label for novel data is y Glbbs = argmaXj /=±1 j/ ^-^= YliLi c l w l x l^j }> using a 
pair of parameters w and c that are randomly selected from the posterior distribution of eq. 
(5). The generalization error of this approach can be assessed as 



Gibbs 

e 9 



:Jdz (i - / Dot (yVf » + tq z ))S d " b {' i^^ u + " I) ■ 



(15) 

using R and Q. Although such a strategy may seem somewhat similar to using the SBC, 
the generalization error of this is not minimized at C = C t , as shown in fig. 1 (b), which 
indicates that minimizing e Glbbs is not useful for identifying C. This may be a reason why the 
determination of hyper parameters was not discussed in preceding work of ref., 4 in which only 
variants of Gibbs learning were examined. 
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4. Development of a BP-Based Tractable Algorithm 

The analysis in the preceding section indicates that the SBC of eq. (6) can provide opti- 
mal performance for a relation that is typically generated from eq. (7). Unfortunately, exact 
performance of the SBC is computationally difficult because a high-dimensional integration or 
summation with respect to w and c is required. Regarding this difficulty, recent studies 12-14 
have shown that an algorithm termed belief propagation (BP), which was developed in the 
information sciences, 15,16 can serve as an excellent approximation algorithm. Therefore, let 
us try to construct a practically tractable algorithm for the SBC based on BP. 
4-1 Belief Propagation 

For this, we first pictorially represent the posterior distribution of eq. (5) by a complete 
bipartite graph, shown in fig. 2. In this figure, two types of nodes stand for the pairs of 
parameters (iq,q) (circle) and labels (square), while the edges connecting these nodes 
denote the components of the data . We approximate the microcanonical prior of eq. (4) 
by a factorizable canonical prior as 

N ( 1 \ 
P(w,c) ocIJexp I --(l-q + GqVf + Aql , (16) 

l=i ^ ' 
where G > and A are adjustable hyper parameters. Then, BP is defined as an algorithm 
that updates the two types of function of (iq, q), which are termed messages, as 



W h JdwJ2 c f (A,) Y[ m M)^{ Wj , Cj ) ' 



-±(1-Cl+Gci)wf+\CI TT CAt I \ 

M t i \ __ e ll u ^^M u ^l{Wl,Cl) 

M '^ w '-' ] .-!(.-+ ^^U^MU^.c,)' ( ' 

between the two types of nodes, where = YliLi c i w i x i an d * denotes the number of 
updates. A4^ M (iq, q) and M f i(wi, q) are approximations at the t-th update of the marginal 
probability of a cavity system in which a single element of data {x^,y^) is left out from D M 
and the effective influence on (tq,q) when x^ is newly introduced to the cavity system, 
respectively. These provide an approximation of the posterior marginal at each update, which 
is termed the belief, given by 

(19) 

At each update, the hyper parameters G 1 and A* are determined so that the pruning constraints 
££i q = NC and Y?=i 

c L wf = NC are satisfied on average with respect to eq. (19), which 
is valid for large N due to the law of large numbers. 
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Fig. 2. Complete bipartite graph representing the posterior distribution of eq. (5). 



4-2 Gaussian Approximation and Self-Averaging Properties 

Exactly evaluating eq. (17) is, unfortunately, still difficult. In order to resolve this difficulty, 
we introduce the Gaussian approximation: 



A M ~ —^ciwi 



where u ~ A/"(0, 1) 



m- 



(20) 



J dwjJ2 Cj ( c j w j)M t j^ fl (wj,Cj) and V^y represents the variance 



of A^y = ■^J2j^iy^ x j c l w l- This approximation is likely to typically hold for large ./V due 
to the central limit theorem when the parameters (wj+i,Cj^i) are generated from the cavity 
distribution Ylj^i-M t j^^(wj,Cj) in the case when the training data x^ are independently 
generated from P sp h(a;). Further, we assume that the self-averaging property holds for V*^, 
which implies that V^, typically converges to its sample average independently of x^ and can 
be evaluated using the t-th belief of eq. (19) as 



(CjWj 



m 



1 1 N 

= jf E ~ (-U) 2 ) =4 E {(^y - K) 2 ) = c-Q t , (21) 

J# i=l 

where (■ ■ ■ )* and (■■■)* denote averages over the cavity distribution Y\.j^i^-^-*n( w j^ c i) an( ^ 
the belief of eq. (19), respectively, m\ = (qw//)* and Q* = Xw^=i( m *) 2 - We note that this 
property was once assumed to hold in equilibrium for similar systems. ' 18 We here further 
assume that this can be extended even to the transient stage in BP dynamics. 13,14 This 
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where 



i t+1 



din 



fi\i/ 



b t+1 , = 



d 



In 



" f Duf (v^C^u + (A Al )l) 



(22) 



(23) 



fDuf (^^C^u + iA^l) 
Inserting these into eq. (18) offers the cavity average m\ as 



(24) 



exp 



2(F t +Q t ) 



1 + 



y/Ft+Q' 



exp 



2(F*+Q*) 



F t + Q t ' 



(25) 



where /i*^ = ^7= Ylv^n V VxV \ a t-+« anc ^ we have introduced the novel macroscopic parameters 

«-cf >f^"'"(^- + ( A "v)r') 



! NfDuf"UC-Qt-iu+(A 



vt-l 



iV 



U fDuf^C-Q^u + iA^- 1 ) 



M 



/£)«/' VC-<2* _1 ^+(A 



vt-l 



iV 



U\jDuf (yC-Qt-iu + (A,)*- 1 ) 



(26) 



(27) 



assuming the further self-averaging property 



AT 



Ai 



- 1 E 



d 



n=i 



t-i 



M=l 



In 



= r 



(28) 
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In eq. (25), the adjustable hyper parameters F l and A* are determined so that the average 
pruning conditions 

l 



1 N 1 N i 

i=i i=i t + , 



CX P 



A ^ 2(F t +Q t ) 



1 



cxp 



A* + 



K) 2 



2(F'+Q J ) 



+ 



t\2 



Ft + Qt + Qt)2 



1 iV j AT 

atE^atE 



exp 



A' + 



(M) 2 



2(F*+Q t ) 



exp 



At + ^il 2 _ 



C, 



(29) 
(30) 



2{F t +Q t )_ 

hold for eq. (19) at each update, where h\ = -7= ^^f=i 

Notice that eqs. (23)-(30) can be used as a computationally tractable algorithm for as- 
sessing the SBC. For this, we evaluate the posterior average mi = {qwi), plugging eq. (22) 
into eq. (19) using eqs. (23) and (24), which provides 



m l 



y/Ft+& 



cxp 



A* + 



K) 2 



2(F*+Q i ) 



1 + 



cxp 



A* + 



(M) 2 



2{F t +Q t ) 



(31) 



This makes it possible to evaluate the SBC of eq. (6) using the Gaussian approximation as 



y = sign (Duf(y/C=Q*u + (A)') - ±) 



(32) 



for an element of data x that is newly generated from P sp h(a;) at each update. 
4-3 Further Reduction of Computational Cost 

The necessary cost of performing the above procedure is 0(N 2 M) per update, which can 
be further reduced. In order to save the computational cost, we represent (A^y)^ and as 



(A V A* ~(A )*- J-YVz^ ( V -^a} 



— — L m;_ 1(1 ~ 



M 

^=1 



t -i 



N 



N 



AT 



(33) 



A/ 



t^t-i y^ x i t 



N 



(34) 



using the singly- indexed variables mj, 



dln\f MVC-Q'-HiA,)^ 1 )] 



, and (A M )* = 



j= X^z=i y^xfmj, where 0* = -7= S M =i y^ x i a li anc ^ 3* = C — Q*. Using these, the algo- 
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(35) 



din 



m l 



1 + 



exp 



2(F'+Q J ) 



(36) 



This version may be useful in analyzing relatively higher dimensional data, as the computa- 
tional cost per update is reduced from 0(N 2 M) to O(NM). 
4-4 Performance Analysis and Link to the RS Solution 

To investigate the performance of the BP-based algorithm, let us describe its behavior 
using macroscopic variables, such as Q l = j? YliLi( mt i) 2 an d R l = jqYld=i w oi m \-, m the 
thermodynamic limit N,M — > oo, keeping a ~ Oil). We will perform the analysis based on 
the naive expression of the algorithm in eqs. (23)-(30), since eqs. (35) and (36) are just a 
cost-saving version of this naive algorithm and, therefore, their behavior is identical. 

For this purpose, we first assume that the self-averaging properties 

-m 1 ■ m* = Q*, 



1 t t 1 
—m.. ■ m.. ~ . 

N m m N 



1 t 1 t 

-tu n • m„ ~ — tu • m 



R\ 



(37) 
(38) 



" » ~ N' 

hold for the macroscopic variables. That the training data x M are independently drawn from 
Psph(x), in conjunction with the central limit theorem, implies that the pair of (A^)^ and 
the teacher's stability A° = J2iLi xt i w oi can be treated as zero-mean Gaussian random 
variables, the variances and covariance of which are 
w a ■ w 



[(A 



o\21 



\ D M 



N 



C u 



D M 



J D M 



(39) 



This makes it possible to represent these variables as 



A^ — 



(40) 
(41) 



using three independent Gaussian random variables u, v,z ~ A/"(0, 1), which, in conjunction 
with the self-averaging property, indicates that the macroscopic properties of the cavity field 
= ^= Ylv^fj, y v ^i at y\ can be characterized independently of I and fi as 
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Q t VQ 1 J f Duf (yc -Q t u+ v^z) 



(43) 



where the prefactor 2 in eqs. (42) and (43) has its root in the two possibilities of the label 
y = ±1. On the other hand, these equations mean that the cavity fields, which can also be 
treated as Gaussian random variables since xf are zero-mean and almost uncorrelated random 
variables, can be represented as 



(44) 



where z ~ Af(0, 1), which yields 
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Fig. 3. Generalization error of SBC provided by the BP-based algorithm (35) and (36) after tth 
update. Markers were obtained from 100 experiments for a transfer function f(u; k = 0.05) = 
0.05 + O.90(u) in the case of C t = 0.2 varying C = 0.1,0.2 and 0.3. Error bars indicate 95% 
confidence intervals. Lines represent the theoretical prediction assessed by eqs. (42)-(48), which 
exhibits excellent consistency with the experimental results. 



Dz 



cxp 



2(F*+Q*) 



1 + 



■■ cxp 



A* + 



2(F*+Q*) 

where F* and A* are determined so that the pruning conditions 
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(47) 
(48) 



y/F t +Q t 

hold for the t-th update. 

In fig. 3, experimentally obtained time evolution of the BP-based algorithm (35) and (36) 
is compared with its theoretical prediction assessed by eqs. (42)-(48), which exhibits excellent 
consistency. This validates the macroscopic analysis provided above. It is worth noting that 
the stationary conditions of eqs. (42)- (48) are identical to those of the saddle point of the RS 
free energy of eq. (13). This implies that the BP-based algorithm provides a nearly optimal 
solution in a practical time scale in assumed ideal situations of large system size if the hyper 
parameter Ct is correctly estimated. 

The replica analysis in the previous section indicates that Ct can be estimated by mini- 
mizing the generalization error e^ BC . The leave-one-out error (LOOE), which is represented 



13/18 



J. Phys. Soc. Jpn. 



Full Paper 




Fig. 4. Comparison between 6g BC (o) and £looe (+) for Ct = 0.1, /c = 0.05 and a = 1. £g BC is evalu- 
ated by replica analysis under the replica symmetric ansatz (Line), while £looe 1S experimentally 
obtained using eqs. (35) and (36) for N,M — 2000. 



as 

in the current case, is frequently used as an estimate of e^ BC for practical applications, since it 
can be evaluated from only the training set D M . The algorithm is also useful for assessing the 
LOOE of eq. (49), since this computes all the cavity stabilities (A M ) at each update, which 
saves the cost of relearning in assessing the LOOE. 18 Fig. 4 shows a comparison between e^ BC 
and Cloqe f° r an activation function, as given in eq. (14), in the case of N, M = 2000. It 
shows that the LOOE (49) can be used in practical applications for determining the necessary 
hyper parameters using only given data. 

The BP-based algorithm of eqs. (35) and (36) is developed and analyzed under the self- 
averaging assumption, which is valid when each data is independently sampled from the 
spherical distribution P sp h(a;). Unfortunately, raw real world data do not necessarily obey such 
distributions, which may deteriorate the approximation accuracy of the developed algorithm. 
A simple approach to handle this problem is to make statistical properties of the data set get 
closer to those of samples from P sp h{x) by linearly transforming so that jj Y^=\ x 11 = 
and jj Y^=i x i x j — $ij hold with keeping | cc^ | fixed to the square root of the dimensionality. 
Such an approach is often termed whitening, the efficacy of which will be experimentally 
examined in the next section. 

5. Application to a Real World Problem 

To examine the practical significance of the SBC, we applied it to a real world problem. 
We considered the task of distinguishing cancer from normal tissue using microarray data of 
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Table I. Classification result 





generalization error (%) 


standard deviation 


SBC 


23.5 


10.2 


SBL 


24.3 


10.7 


FDA 


39.6 


8.39 



N = 2000 dimensions. 3 The data was sampled from M = 62 tissues, 20 and 42 tissues of 
which were classified as normal and cancerous, respectively. 

We employed eq. (14) as the activation function. The data set D M = {(x^, y M )} 19 was pre- 
processed so that jj Yl%!=i x^ = and \x^\ = y/N held. The data set was randomly divided 
into training and test sets, which were composed of 42 and 20 tissues, respectively. For a given 
training set, the hyper parameters C and k were determined from the possibilities of C = 
{2.5 x lO- 3 , 5.0 x 10~ 3 , 1.0 x 10- 2 , 5.0 x 10~ 2 , 0.1, 0.2, 0.4, 0.6, 0.8} and k = {0.1, 0.2, 0.3, 0.4}, 
so as to minimize eq. (49) for the training set. After determining C and k, the generalization 
error was measured for the test set. We repeated this experiment 200 times, redividing the 
data set. 

The results are shown in table I. The conventional Fisher discriminant analysis (FDA) 20 
and Sparse Bayesian learning (SBL), 21 which selects a sparse model using a certain prior, 
termed the automatic relevance determination (ARD) prior, 22 ' 23 are presented for comparison. 
It is apparent that the FDA, which does not have a mechanism to reduce effective dimensions, 
exhibits a significantly lower generalization ability than the other two schemes. This is also 
supported by Welch's test, which is a standard method to examine statistical significance of 
difference of averages between two groups, although the standard deviation of FDA is smaller 
than those of the others. On the other hand, although the average generalization error of the 
SBC is smaller than that of SBL, the standard deviations are large, which prevents us from 
clearly judging the superiority of the SBC. 

In order to resolve this difficulty, we examined how many times the SBC provided a 
smaller generalization error than SBL in the 200 experiments. The number of times that the 
SBC offered smaller, equal and larger errors than SBL were 99, 36 and 65, respectively. A 
one-sided binomial test was applied to this result under the null hypothesis that there is no 
difference of the generalization ability between SBC and SBL ignoring the tie data, which 
yields ■ " < ' 2 °° 3 ^ x ^ = 2.65 • • • > u(0.05) = 1.64 • • • under the normal approximation. This 

^/(200-36)x§x§ 

implies that the difference between SBC and SBL is statistically significant with a confidence 
level of 95% and, therefore, the SBC has a higher generalization ability. 

Histograms of selected values of C and k are shown in figs. 5 and 6, respectively. They 
indicate that k has a statistically greater fluctuation. For reference, we performed experiments 
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Fig. 5. Histogram of selected C 



fixing k to the most frequent value, k = 0.4, which reduced the average and standard deviation 
of the generalization error of SBC to 21.0(%) and 9.72, respectively. This determination by 
the resampling technique is an alternative scheme for estimating hyper parameters. Although 
performing it naively requires a greater computational cost than minimizing the LOOE, a 
recently proposed analytical approximation method 24 may be promising for reducing compu- 
tational cost, and will be the subject of future work. 

In the algorithm we have developed, we assumed a self-averaging property. This assump- 
tion may give a good match with whitened data, i.e., data for which the dimensionality is 
reduced from N = 2000 to M = 62 by a linear transformation so that Y^=\ # M = and 
if J2^=i x i x j = $ij h°ld in the reduced space. We also carried out the above experiments 
for the whitened data fixing k to 0.4, finding that the average and standard deviation of the 
generalization error of the SBC were reduced to 16.3(%) and 6.20, respectively. However, such 
an approach may not be preferred because it becomes difficult to interpret the implications 
of the result as the original meaning of variables is lost by the linear transformation. 

6. Summary 

In summary, we have developed a classifier termed the sparse Bayesian classifier (SBC) that 
eliminates irrelevant components in high-dimensional data x £ 1Z N by multiplying discrete 
variables q 6 {0, 1}, 1 < / < N for each dimension I following the Bayesian framework. The 
efficacy of the SBC was confirmed by the replica method for the target rules of a certain type. 
Unfortunately, exactly evaluating the SBC is computationally difficult. In order to resolve 
this difficulty, we have also developed a computationally tractable approximation algorithm 
for the SBC based on belief propagation (BP). It turns out that the developed BP-based 
algorithm provides a result consistent with that of replica analysis for ideal situations, which 
implies that a nearly optimal performance can be obtained in a practical time scale in such 
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Fig. 6. Histogram of selected k 

ideal cases. Finally, the significance of the SBC to real world applications was experimentally 
validated for a problem of colon cancer classification. 

In this paper, the classifier was developed for minimizing the generalization error. Identi- 
fying relevant components from a given data set may be another purpose of the classification 
analysis. Designing a classifier for this purpose is currently under way. 
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